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Abstract-Inspired by biology, this paper investigates voice 
processing method based on the central auditory system of the 
human brain. An integrated voice separation model is 
established by simulating central auditory system. Voice 
signals are processed under multi-spectral analysis by using 
peripheral auditory model. A coincidence neuron model is 
deployed to extract the features from voice signals so that the 
voice is separated in the cell model of brain inferior colliculus. 
Experimental results show that the integrated voice 
separation model can separate the voice in a multi-source 
environment reliably. 
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I. INTRODUCTION 

Up to now, multi-source speech recognition research has 
been conducted by worldwide researchers for many years, and 
has resulted in three popular approaches. The first approach is 
the application of three cues for sound localization, which was 
proposed by Rodemann et al. And it did not take advantage of 
the biological connections between the superior oliver 
complex (SOC) and inferior colliculus (IC) [1] . The second 
approach was proposed by Voutsas and Adamy, in which a 
multi delay-line model is developed by using spiking neural 
networks (SNN) incorporating realistic neuronal models. Their 
model only takes into account ITD and is not effective for 
frequencies higher than 1.5 kHz [2] . The third approach was 
proposed by Willert and Nixim, in which a probabilistic model 
is built to estimate the position of the sound sources, and the 
Bayesian theorem is used to calculate the connections between 
them. However, this method did not use spiking neural 
network to simulate realistic neuronal processing [3] . 

However, all the approaches mentioned above have good 
signal separation performance in a specific lab environmental 
Their performance will be greatly degraded in a multi-source 
speech environment since the model does not completely 
simulate the central auditory system's role of separation in 
voice. During the past twenty-five years, our study for 
auditory central nervous system has made significant progress 
and shown that inferior colliculus (IC) plays a crucial role in 
the process of getting auditory information. IC is a hub and 
processing centre of extracting sound features [5] . Here, 
binaural time difference (ITD) and level difference (ILD) are 
both extracted from the sound. 

Researches on hearing show that our two ears can identify 
features according to the differences of distance from source 
to each ear and the masking condition difference in the way of 



3sound transmission. There are binaural time differences and 
binaural level difference when sound arrives from a place to 
each ear. As shown in Fig. 1, the time from sound source to 
each ear is different. When inputted speech information is 
separated by auditory nerve centre, the time difference and 
level difference are an important basis for sound separation [6] . 

(Sound Source ) 

S 




Fig.l Binaural time difference (ITD) between two ears 

IC will control the auditory cilia of inner ear nerve to 
response threshold, low frequency (< 1.5kHz) voice signal will 
pass through the central area of the medial superior olive 
(MSO) to IC, (In this frequency range, ITD is more efficient to 
voice separation). While high frequency (> 1.5kHz) voice 
signal will go through the central area of both medial superior 
olive (MSO) and Lateral superior olive (LSO) to IC, in this 
frequency range, ILD is more efficient to voice separation. At 
last, signals from different area enter IC separately [7]. 
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Fig.2 MSO and LSO analysis of the speech signal spectrum 

Another important feature in the nerve tissue of IC is, 
physically, the use of a multi-layer anatomical structure to 
decompensate sound signals is in accordance with frequency. 
Each layer of never cells only responds to specific frequency 
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components. This anatomical feature is called frequency 
anatomical feature, which makes multi-band input audio 
isolated into space in the IC [8] . Thus, the sounds from the 
same source or with the same frequency characteristics can 
easily be coincided and extracted. In the noisy multi-source 
environment, then speech signal is separated and extracted to 
re-generate signal flows [9] . 

In summary, the auditory central nervous system can 
effectively separate input multi-source noisy, and a complete 
voice separation model which simulates the auditory central 
nervous system is possible to solve the voice recognition 
problem under a dynamic and complex environment. 

II. THE SPEECH SEPARATION MODEL BASED ON CENTRAL 
AUDITORY SYSTEM 

Fig. 3 is the schematic diagram of separation system based 
on auditory central nervous system, which is a complete 
calculation model of a simulating auditory central system. 
Multi-channel audio signals are firstly divided into different 
frequency channels according to different frequencies when 
going through the peripheral auditory model. Then voice 
information is extracted when through Oliver complex (SOC, 
including MSO and LSO). Finally, the multiple sound sources 
are separated into individual voice signals in the IC cells [10] . 
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Fig.3 The speech separation model based on central auditory system 

A. Auditory Peripheral Models 

Acoustic studies have shown that the human ear canals 
have different responses to information with different 
frequencies. The basement membrane located inside the 
cochlea is an important link to the auditory central system, and 
can decompensate frequency. Different frequency signals will 
stimulate different vibrations in different locations of 
basement membrane [11] . 

Based on these features of basement membrane, the 
processing of audio frequency peripheral uses the 16 
second-order discrete Gammatone filters, which is used for 
multi-frequency analysis. The frequency options range from 
100Hz to 4 kHz, which are chosen separately to do frequency 
decomposition for left and right ear signal aliasing by time 
frame. Each second-order discrete filter has a z-domain form, 



F t (z) = ( 



5 ,< 

2 -b t T , _. , -2b i T J 

z - 2e cos(wT) + e 



]T = (z - e b{T COs(wT) +(-l) /c V3 + (-l) i 2 15 sin( w.T) (1) 

j,k=i 



Where i is the i-th number of filters in the group, W. 

represents the frequency characteristic of each filter, b { 
means the fixed bandwidth of each frequency. 

Fig. 4 is a set of frequency response graphs of Gammatone 
filters with the auditory peripheral model: 




Frequency[Hz] 

Fig. 4 The frequency response of filter group consisting of Gammatone 
filters 

In Fig. 4, a filter group consists of 16 Gammatone filters 
with a range of frequency from 100 Hz to 4 kHz. For the input 
speech signal, after multi-frequency analysis of the auditory 
peripheral model, it passes through the auditory central system 
in 16 different frequency channels according to frequency 
difference. 

B. Coincidence Neuron Model 

Coincidence neuron model imitates the response of model 
of synaptic and neuronal to complete extraction and 
integration for audio information. 

1) Generic Synaptic Model 

Voice signal (wave) is spread by neurotransmitters when 
going through inner hair cell in the auditory central system. 
The vibration caused by wave on basement membrane will 
create neurotransmitters release through permeable membrane 
to synaptic cleft, and the movement of the hair brings the 
auditory nerve to a firing. The penetration of permeable 
membrane, h(t), will change according to the input signal 
amplitude. Each Gammatone filter output will go through 
HWR. 



h(t): 



\gdt 



x(t) + A/ 



x(t) + A+B_ 



[x(t) + A] > 0, 



(2) 



[o,[x(t) + a] <=o 



Where A is the percolation threshold produced by signal 
x(t), g is a data related to penetration, dt means sampling 
interval, and B associates with the maximum penetration. 

The model assumes that the inner hair cell could 
manufacture transmitter substance. The amount of transmitter 
substance in free store lying close to the membrane is q(t), and 
it is restored at a rate y[l-q(t)]. The amount of transmitter 
substance in the synaptic cleft is c(t), where the amount of 
being re-up-take into the hair cell is rc(t). At the same time, 
some of it is lost from cleft and from the system altogether at a 
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rate lc(t), which is summarized as follows. 

— = y[l-q(t)] + rc(t)-h(t)q(t) 
dt 



dc 
dt 



■ = h(t)q(t)-lc(t)-rc(t) 



Then the auditory-never firing rate is: 
p(t) = hc(t)dt 



(3) 



(4) 



(5) 



Equations (3), (4) and (5) consist of inner hair cell and 
synapse model. Where the g, y, r, I and h are constant and have 
the proper value, dt is the sampling interval. 



2) 



Generic Soma Model 



The transmitter molecules diffuse to the postsynaptic 
neuron through the synaptic cleft. The decay of the transmitter 
concentration is simulated by a leaky integrate-and-fire model. 
The amount of transmitter molecules on the postsynaptic 
neuron changes its permeability to certain ions. Ionic channels 
thus open gradually, receiving even more ions and forming a 
current. This procedure is modelled using a conductance 
mechanism which forms a gradually increasing postsynaptic 
current. 

Leaky integrate-and-fire model that has been proposed as 
the model of neurons can be used for processing time-varying 
signals [12] in powerful computing systems. The form of an 
integrate-and-fire model consists of a resistor R in parallel to a 
capacitor C which is charged by a current I(t) in order to 
simulate the procedure of the postsynaptic current charging 
the ITD coincidence model. The voltage U(t) across the 
capacitor C is compared to a threshold which is cp . The 

schematic diagram of the leaky integrate-and-fire model is 
shown in Fig. 5. 



i(t) 



c 



TTL 

U(t)g~ 
6{t) 



Fig. 5 Schematic of the leaky integrate-and-fire model 

Between spikes, the voltage of a leaky integrate-and-fire 
model depends on: 



U(t) = Uexp 



RC 



1 rt-t t 

tJo 



exp 



s 
RC 



I(t)ds (6) 



Where, U r is the initial membrane potential, a typical 
value for RC is 1.6ms. When U(t) = cp at time t an output 
spike 9(t)\s generated, and then U(t) is reset to the initial 
voltage U r . 

3) Coincidence Neuron Model 

In Figure 6, ITD pathway, the spike sequence of the 
contralateral ear passes variable delay line At . The delayed 



spike sequence is denoted as S c (At. , f . ) , where C stands for 

the contralateral, At for the delay time, f for the 

frequency channel j . Similarly, S 7 (AT, f . ) represents the 

delayed spike sequence of the ipsilateral ear with a fixed delay 

time AT . S c (At , f . ) and S i (AT, f. ) are then input to the 

ITD coincidence model to calculate the ITD. The calculated 
output of the ITD coincidence model is a new spike 

sequence S itd ((AT - At. ), f ) . If there are spikes 

in S itd ((AT - At. ), f ) , it means the sound arriving to the 

ipsilateral ear is earlier than that to the contralateral ear by 

ITD = AT - At. second. A negative ITD value means the 

sound arriving at the right ear is later than that at the left ear. 
As shown in Fig. 6, ES stands for excitatory synapse 
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Fig. 6 ITD coincidence model 
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Fig. 7 ILD coincidence model 

In Figure 7, the ILD pathway is modelled without using 
the leaky integrate-and-fire model. Instead, the detected sound 
levels of two sides are compared to calculate the level 
difference and the corresponding ILD cell is fired with a spike. 

The level difference can be calculated as Ap J = log( 



where pj and p J c stand for the ipsilateral and contralateral 

sound levels for the frequency channel j respectively. Similar 
to the ITD model, the output of the ILD model can be 

represented as a 2D spiking map, S ild (Ap . , f . ) . A negative ILD 

value means the level of sound going to the right ear is lower 
than that to the left ear. As shown in Fig. 7 ipsi and contra 
stand for ipsilateral and contralateral Gammatone frequency 
channels. 

Before merging ITD and ILD maps for broadband sound 
separation, we must consider the advance of ITD and ILD in 
different frequency bands. On one side, ITD can accurate 
extract the features from voice signals up to 1.5 kHz but it 
begins to be ambiguous when frequency is above 1.5 kHz. On 
the other side, ILD is very trivial below 1.5 kHz. So we build 

two weight arrays, ITD w and ILD w , over frequency and 

compute either weighted ITD or ILD map by multiplying the 
weight array with the matrix of a 2D ITD/ILD map. 
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ITD 3 



ILD 



sum (max( ] /{ 500 ,l)) 
I 

f • / 
max(log(% 300 ,)0) 

sum (max(log( %,„„), 0)) 



(7) 



(8) 



Here j is the channel index. Weighted ITD and ILD were 
eventually integrated into the mapping information together, 
which is the output of MSO and LSO, and finally entered into 
the hypothalamus of the brain nerve cells to the voice 
information extraction and separation. 

C. IC Cell Model 

The cells in the IC can be classified into 6 physiological 
types including Sustained-regular, Rebound-regular, Onset, 
and so on. We tested the Oncet cell which is among the 6 types. 
The Onset cell of IC is simulated in order to separate the 
speech signal in this paper. Fig. 8 is the structure of Onset cell 
of IC. 



Input from SOC 
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Others 




Output 



Fig. 8 The IC's onset cell 

In general, Onset cell has two states: active and inactive. In 
Fig.8, when the cell is active, the cell is implemented as a LIF 
neuron until a spike is fired, or the cell receives an inhibitory 
input after which the cell state becomes inactive. The cell goes 
back to active state only when there is no inhibition and the 

input is (no spike) for a period of t s . 

Onset cell model is re-used for multi-source speech signal 
separation. First, the speech signal energy in the nerve cell 
model of the i-th frequency channel and the j-th time frame 

Z^2 and the noise signal energy y<.2 and energy 

i i 

y 2 is calculated. Then the signal energy ratio is 

^ i,j(t) 



calculated as follows. 



2C> 



Is'ko + ZC 



(9) 



On one hand, when e. . >0.5, it means if the voice energy 
is greater than the noise energy, this voice should be remained. 



On the other hand, when e. . <0.5, this voice should be 

discarded. The Onset cell model was then re-used to obtain the 
ITD and ILD values to build a masking matrix for separating 
the voice signal. Here, binary masking for the i-th frequency 
channel and the j-th time frame of the masking coefficient can 
be defined as, 



HU J) ^ 



m(f t * f c )and[T waK (i,j)]>r\i,j)] 

lif [( I > f c )and[L(i, j)] > T in (i, j)] (10) 

Oelse 



r w (u) and r (;) (u) are the 



Where f =1.5 kHz, 
threshold separately for ITD and ILD. T n n is the max 

max v » J / 

delay of the i-th frequency channel and the j-th time frame. 
L(i, j) is the value of ILD of the i-th frequency channel and 
the j-th time frame. 



Zp,(u,o 2 



L(i,j) = 20 lg^ 



Emu.o 2 



(ii) 



Where Pi(i,j,t) and P(i,j,t) are the signal firing rate of left 
and right ears of the i-th channel and the j-th time frame. 

Multi-source speech signal in each frequency channel and 
time frame of the masking coefficients are calculated and then 
get masking matrix. Likewise matrix elements 1 and are 
used respectively for the same ownership. All the elements 1 
are in the same matrix. Actually, the Fourier transform of its 
autocorrelation is equal to the square of its Fourier transform 
magnitude. It is assumed that R^t) is the autocorrelation of 

x(t). Then the power spectrum of x(t) is, 



\X(w)f=jyjry M dT 



(12) 



Therefore, by simple operations, we can obtain the 
magnitude of the short time Fourier transforms. The remaining 
problem is how to reconstruct signals only from the magnitude 
of the Short Time Fourier Transform(STFT). To achieve this 
operation, the iterative algorithm is used to reconstruct the 
phase of the signal at each of iteration in order to decrease the 
square error between the STFT magnitude of the reconstructed 
signal and the STFT magnitude which is known already. Then, 
we can obtain the estimate value and minimize the square 
error between the magnitude of the STFT of the signal 
reconstructed and the magnitude of the STFT which is known 
already. 

The reconstructed signal by Z iterations is, 



x ( °(n ): 



v^ If* r 

2^ w(mS-n) — X (m,n)e 

r-, J-7T 



dw 



(13) 



X w (ms - n) 



Where w(mS 
window shift. 



n) is the analysis window and S is the 



Then we can obtain the STFT X il) (m,n) of the i-th 

reconstructed signal in terms of x (l) (n) , and acquire the error 

between which the primary given short-time magnitude use 
equation as follows, 
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Zri II (0 | I 1 1 

2^ \\X (m,n)\-\X d (m,n)\\ 



(14) 



If the error is smaller than the given value, iteration is over, 
or what we should do is to calculates (m,n), and run the 



i+l=th iteration, x (m,n) is as follow. 



a(0 

X (m,n) = \xAm,n)\ 



X (m,n) 
\x l (m,n)\ 



(15) 



Thus the auditory-nerve firing rate p(t) of each channel of 
the auditory model could be calculated by correlogram. The 
next step is to reconvert the signal h(t) coming out of the 
HWR stage in terms of p(t). 



c(t) = 



Pit) 
hdt 



(16) 



Then q(t) and h(t) could be obtained as follows. 

q(t) = y[l-q(t-l)]dt-lc(t-l)dt-c(t)-c(t-l) + q(t-l) (17) 



h(t) 



c(t) - c(t - 1) 7 



dt + lc(t) + r(t) 




(18) 



Where h(t) is the signal expression coming out of the 
HWR stage. 

Following the reverse process of the HWR is to 
reconstruct the negative parts of the signal after re-iterating. 

III. EXPERIMENTAL RESULTS 

To verify the correctness and effectiveness of the above 
model, many experiments are implemented. In the 
experiments, Onset cell is investigated on a PC with the Intel 
Pentium 2.5GHz and lGBits memory, and Matlab-Simulink is 
used in the tests at a sampling rate 44.1 kHz and 16 bit 
sampling precision. The three types test data of all are 
summarized in this paper, which are class A, class B and class 
C. Each set of data includes voice signals and noise signals (or 
other interference voice) respectively. 

Class A includes sources which are composed by Chinese 
words, and source two composed by traffic noise. Class B 
includes sources which are composed by Chinese phrases, and 
source two composed by the traffic noise. Class C includes 
source which composed by Chinese female word, and source 
two composed by Chinese male word. 
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Fig. 9 The signal waveforms of simulation results in the third group 



Figure 9 is the signal waveform comparison simulation 
results of group C. The first line of signal waveform is the 
original female voice "forward". The second line is the 
original male voice "forward". The third one is the signal after 
aliasing. The fourth line is the separation of the source signal 
with the male voice "forward". The fifth one is the separation 
of source signals with female voice "forward." 

For the tests of three types, class A, class B and class C, a 
large number of repetitive tests have been implemented. In 
this paper, 20 groups test results of each class were randomly 
selected, and the separated voice signal and the original 
speech waveform will be compared by Matlab. 
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Figure 10. Similarity curve table of the first group 



Similarity curve table of the second group | — ■— second series] 
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Fig. 11 Similarity curve table of the second group 



Similarity curve table of the third group 
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Fig. 12 Similarity curve table of the third group 
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Fig. 10, Fig. 11 and Fig. 12 present the results of similarity 
comparison in three groups. Abscissa represents the number of 
tests; the vertical axis represents the similarity of the separated 
and original voice signal. As shown in the graph, the 
similarities of the separated voice signal and the original 
signal are more than 0.97. As a result, the robustness of the 
method for the separation of speech signals is high. 

Recently, Voutsas and Adamy built a multi delay-line 
model using spiking neural networks incorporating realistic 
neuronal models. Their model only takes into account ITD. 
The average similarity reaches 0.975 when the frequency of 
sound is less than 1.5 kHz, but it is not effective for the 
frequency which is higher than 1.5 KHz. Fig. 13 presents the 
results of similarity comparison of the model of Voutsas and 
Adamy. Test data from 1 to 25 show the model of voutsas and 
Adamy used in the frequency which is less than 1.5 kHz. Test 
data from 25 to 50 shows the model of voutsas and Adamy 
used in the frequency which is more than 1.5 kHz. 
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Fig. 13 Similarity curve table of the model of voutsas and Adamy 

From Fig. 13, we can see that the model of Voutsas and 
Adamy is only used for the sound frequency less than 1.5 kHz. 
But the model proposed in this paper is used for the sound 
frequency higher than the frequency ranges. 

For experiments of class A and class B, using this 
approach can improve signal- to- noise ratio of the speech. 
This coincidence neurons model in the integration of the 
information ITD and ILD selects five of the azimuth data. 
Signal- to- noise ratio is calculated according to: 



SNR(dB) =101g(2s(t) 2 



S(t)-S(t) 



(19) 



Table 1 shows the results from comparing, in which 
"Before" means "Before separation" and "After" means 
" After separation". 

TABLE 1 THE CONTRAST OF SIGNAL TO NOISE RATIO 



Azimuth 


First 


group 


Second 


group 




Before 


After 


Before 


After 


, 30 


15.8 


50.2 


13.7 


48.2 


, 60 


14.6 


49.7 


12.9 


47.8 


45 , 90 


14.5 


49.1 


12.8 


48.1 



As we can see: when two sound sources are coincided with 
certain spatial orientation differences, the separation of the 
signal- to- noise ratio has been greatly improved. When the 
incident spatial orientation difference of two sound sources is 
small, the separation of the voice signal- to- noise ratio is not 
very different to one that is before separation. For example, 
when the azimuth (0,0) is selected as (135,140), the 

coincidence neurons model in the calculation information of 
ITD and ILD is likely to cause bias, then resulting in masking 
coefficient calculation error. This phenomenon can also be 
used to explain the phenomenon of human hearing. For 
example, when the two sounds come from very similar 
azimuth, the human auditory system is difficult to distinguish 
them. 

IV. CONCLUSION AND FUTURE WORK 

This paper proposes a sound separation method in the 
environment with multi-voice sound sources. A model of the 
human brain auditory central nervous system was completely 
established. Compared with the existing voice recognition, 
this model is a good solution to the problems that a vast 
majority of speech recognition methods can only be used in an 
environment with single sound source and low-noise. 

In the further research, the central auditory system model 
can be firstly applied to the speech recognition of mobile 
robots to achieve a high voice recognition rate; Secondly, this 
model can be used in hearing aids for hearing-impaired people; 
Thirdly, this model can assist the current text retrieval; 
Fourthly, the biological model can be used in sound field 
reconstruction and virtual- reality environment. 
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